By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie
Notebook released under the Creative Commons Attribution 4.0 License.
Multiple linear regression generalizes linear regression, allowing the dependent variable to be a linear function of multiple independent variables. As before, we assume that the variable $Y$ is a linear function of $X_1,\ldots, X_k$: $$ Y_i = \beta_0 + \beta_1 X_{1i} + \ldots + \beta_k X_{ki} + \epsilon_i $$ for observations $i = 1,2,\ldots, n$. We solve for the coefficients by using the method of ordinary least-squares, trying to minimize the error $\sum_{i=1}^n \epsilon_i^2$ to find the (hyper)plane of best fit. Once we have the coefficients, we can predict values of $Y$ outside of our observations.
Each coefficient $\beta_j$ tells us how much $Y_i$ will change if we change $X_j$ by one while holding all of the other dependent variables constant. This lets us separate out the contributions of different effects.
We start by artificially constructing a $Y$ for which we know the result.
In [81]:
# Import the libraries we'll be using
import numpy as np
import statsmodels.api as sm
# If the observations are in a dataframe, you can use statsmodels.formulas.api to do the regression instead
from statsmodels import regression
import matplotlib.pyplot as plt
# Construct and plot series
X1 = np.arange(100)
X2 = np.array([i^2 for i in range(100)]) + X
Y = X1 + 2*X2
plt.plot(X1, label='X1')
plt.plot(X2, label='X2')
plt.plot(Y, label='Y')
plt.legend();
We can use the same function from statsmodels
as we did for a single linear regression.
In [101]:
# Use column_stack to combine independent variables, then add a column of ones so we can fit an intercept
results = regression.linear_model.OLS(Y, sm.add_constant(np.column_stack((X1,X2)))).fit()
print 'Beta_0:', results.params[0], 'Beta_1:', results.params[1], ' Beta_2:', results.params[2]
The same care must be taken with these results as with partial derivatives. The formula for $Y$ is ostensibly $3X_1$ plus a parabola. However, the coefficient of $X_1$ is 1. That is because $Y$ changes by 1 if we change $X_1$ by 1 while holding $X_2$ constant. Multiple linear regression separates out contributions from different variables, so that the coefficient of $X_1$ is different from what it would be if we did a single linear regression on $X_1$ and $Z$.
Similarly, running a linear correlation on two securities might give a high $\beta$. However, if we bring in a third security (like SPY, which tracks the S&P 500) as an independent variable, we may find that the correlation between the first two securities is almost entirely due to them both being correlated with the S&P 500. This is useful because the S&P 500 may then be a more reliable predictor of both securities than they were of each other. We can also better see whether the correlation between the two securitites is significant.
In [96]:
# Load pricing data for two arbitrarily-chosen assets and SPY
start = '2014-01-01'
end = '2015-01-01'
asset1 = get_pricing('DTV', fields='price', start_date=start, end_date=end)
asset2 = get_pricing('FISV', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)
# First, run a linear regression on the two assets
slr = regression.linear_model.OLS(asset1, sm.add_constant(asset2)).fit()
print 'SLR beta of asset2:', slr.params[1]
# Run multiple linear regression using asset2 and SPY as independent variables
mlr = regression.linear_model.OLS(asset1, sm.add_constant(np.column_stack((asset2, benchmark)))).fit()
prediction = mlr.params[0] + mlr.params[1]*asset2 + mlr.params[2]*benchmark
print 'MLR beta of asset2:', mlr.params[1], ' MLR beta of S&P 500', mlr.params[2]
# Plot the three variables along with the prediction given by the MLR
asset1.plot()
asset2.plot()
benchmark.plot()
prediction.plot(color='y')
plt.legend(bbox_to_anchor=(1,1), loc=2);
In [79]:
# Plot only the dependent variable and the prediction to get a closer look
asset1.plot()
prediction.plot(color='y');
In [102]:
mlr.summary()
Out[102]:
The validity of these statistics depends on whether or not the assumptions of the linear regression model are satisfied. These are:
Multiple linear regression also requires an additional assumption:
If we confirm that the necessary assumptions of the regression model are satisfied, we can safely use the statistics reported to analyze the fit. For example, the $R^2$ value tells us the fraction of the total variation of $Y$ that is explained by the model. When doing multiple linear regression, however, we may prefer to use adjusted $R^2$, which corrects for the small increases in $R^2$ that occur when we add more regression variables to the model, even if they are not significantly correlated with the dependent variable. Adjusted $R^2$ is defined as $$ 1 - \frac{n-1}{n-k-1}(1 - R^2) $$ Other useful statistics include the F-statistic and the standard error of estimate.